TTU-WSU-GBAD

VAST 2008 Challenge
Mini Challenge 3:  Cell Phone Calls 

Authors and Affiliations:

William Eberle, Tennessee Technological University, weberle@tntech.edu  [PRIMARY contact]

Lawrence Holder, Washington State University, lholder@wsu.edu

 

Student team: NO

Tool(s):

In order to analyze the Paraiso cell-phone traffic, specifically the Catalano/Vidro social structure, we used the Graph-Based Anomaly Detection (GBAD) tool to focus the visualization on interesting structural anomalies.  Initially created in 2006 as a joint venture between the University of Texas at Arlington and Washington State University, GBAD discovers anomalous instances of structural patterns in data, where the data represents entities, relationships and actions in graph form. Input to GBAD is a labeled graph in which entities are represented by labeled vertices and relationships or actions are represented by labeled edges between entities.  Using the minimum description length (MDL) principle to identify the normative pattern that minimizes the number of bits needed to describe the input graph after being compressed by the pattern, GBAD embodies novel algorithms for identifying the three possible changes to a graph:  modifications, insertions and deletions.  Each algorithm discovers those substructures that match the closest to the normative pattern without matching exactly.  As a result, GBAD is looking for those activities that appear to match normal patterns, but in fact are structurally different.  GBAD is a Unix-based tool written in C, and uses the SUBDUE graph-based data mining system (www.subdue.org) as the engine for discovering the normative pattern in a graph.  GBAD was developed by William Eberle (http://users.csc.tntech.edu/~weberle/) and Lawrence Holder (http://www.eecs.wsu.edu/~holder/).

 

Two Page Summary:   YES

 

Analyzing Catalano/Vidro Social Structure Using GBAD


ANSWERS:


Phone-1: What is the Catalano/Vidro social network, as reflected in the cell phone call data, at the end of the time period  

 

   PhoneNodes.txt

   PhoneLinks.txt

 


Phone-2  Characterize the changes in the Catalano/Vidro social structure over the ten day period.

 

Detailed Answer:

 

Our Graph-Based Anomaly Detection (GBAD) system takes a graph-representation of data and applies three algorithms that analyze the graph for structural anomalies.  Each of these algorithms is applied after the normative graph structure has been discovered.  It is our hypothesis that such a system can discover knowledge in a graph representation of the Cell Phone Calls (social network) data that will (1) show the normal social structure of the Catalano’s and Vidro’s, (2) show when the social structure has been broken, and (3) show anomalies in the social behavior of this group, indicating possible breaches to their “inner circle”. 

 

In order to apply GBAD to the problem data set, we first had to convert the cell-phone calls to a graph representation.  Initially, we created a graph from all of the data that was provided, as shown in Figure 1.

 

Figure 1  Graph representation of cell-phone traffic.

 

In this graph representation, each call consists of a vertex for each PERSON found in the original data set.  Each time a PERSON is a SENDER or RECEIVER of a call, a link is made between each PERSON vertex, with a corresponding intermediary CALL vertex.  Each CALL vertex, unique to a placed call, contains three properties:  the DAY the call was made, the DURATION of the call, and the LOCATION of the call origination cell tower.

 

However, since the goal of this objective was to characterize the changes in the social structure of the Catalano/Vidro network, we then decided to forego the DAY, DURATION and LOCATION links (and their associated vertices), and focus on just the social interactions between the pertinent parties.  Thus, we decided to just create, in essence, a social network that indicated for a particular day, who called who.  To start with, based upon all of the information that was provided, we made the following assumptions about this particular data set:

 

  • The person with an ID of 200 is Ferdinando Catalano.
  • Anyone that Ferdinando Catalano calls (or that calls him) is in his “inner circle”.
  • The person with an ID of 5 is Estaban Catalano, Ferdinando’s brother, as he is called the most frequently.
  • The person with an ID of 1 is David Vidro, as he talks the most to the others that Ferdinando talks to.

 

Starting with these simple assumptions, and a graph that consisted of vertices for each unique ID with links between the vertices if there was a conversation on that day, we were able to create a simple visualization of Ferdinando’s “inner circle” social network structure (or Catalano/Vidro social structure) over the 10 days that data was generated (Figure 2).

 

 

Figure 2  Graph representation of Ferdinando's inner circle.

 

Figure 2 (and subsequent figures) was rendered using AT&T’s graph visualization program (http://www.graphviz.org/).  In particular, the GraphViz tool we used is called dotty.  Dotty takes a graph file, specified in the GraphViz format, and renders the graph in a visual format (like what is shown in Figure 2).  The expected input format to dotty consists of vertex numbers, and their corresponding labels and colors, and edges that specify direction, label, arrowhead and color.  One of the tools available with GBAD/SUBDUE called graph2dot can convert the GBAD/SUBDUE graph input file into the GraphViz dotty format.  So, we took the graph input file, ran graph2dot on the input file, which produced a dotty version of the graph, which we then input into the Windows version of dotty.  This “visualization” (shown in Figure 2) shows the graph structure of interactions between people in 200’s (i.e., Ferdinando Catalano’s) inner circle (i.e., 200, 1, 2, 3, 5, 97 and 137).  Each of the disconnected subgraphs represents the calling-history between the targeted group members for a single day.  Thus, you will notice that there are only 9 substructures in the graph.  This is due to the fact that on Day 8, nobody in 200’s inner circle talked to each other.  In other words, there were no calls between 1, 2, 3, 5, 97, 137 or 200 on that day.

 

Once we had generated the graph, we ran the graph through the GBAD system, applying all three algorithms to see if any (or all) of them would discover any anomalous patterns in the social structure.  GBAD is a Unix command-line tool with various parameters that control the search space.  Through command-line parameters, a user can specify various performance tuning criteria such as pruning, the size of the search beam, and the number of substructures to be kept on the candidate list.  Since the cell-phone traffic for this mini-challenge only consists of 400 call records, we only customized two parameters (leaving the other parametric settings to their default values):  the number of substructures kept on the candidate list was set to 200, and the number of GBAD iterations was set to 2.  Both of these values are somewhat arbitrary, and were chosen due to the size of the graph and the size of the normative pattern.  The first parameter indicates that we wanted to keep only the top 200 best substructures, as the diversity in substructure patterns is somewhat sparse; and only two iterations were needed because each phone call is only composed of a few attributes, for which most of the time these attributes were included in the normative pattern.

 

Figure 3 shows the original graph, as well as the normative pattern and anomalous patterns.

 

 

Figure 3  Ferdinando's inner circle with associated normative pattern and anomalies.

 

In Figure 3, the structure in green indicates the normative pattern that was discovered in this graph.  The substructure in red indicates an anomaly that was discovered by the GBAD-P algorithm, which analyzes a graph for anomalous extensions to the normative pattern.  In this case, the fact that 5 called 97 was anomalous, when compared to other instances of what was the normal social structure.  The substructure in orange indicates an anomaly that was discovered by the GBAD-MPS algorithm, which analyzes a graph for missing structure.  In this case, the fact that 200 did not talk to 3 on that day is considered anomalous.  Again, we were able to visualize these results by manually editing the dotty input file and changing the “color” values based upon the results that were returned by the GBAD system.  GBAD returns the anomalies in an ASCII text format, and does not render a graphical visualization.  But, as is shown, a simple visualization of these results using the GraphViz dotty utility shows the usefulness of this approach. 

 

Looking further at this visualization of Ferdinando Catalano’s calling-history, we are able to make several interesting observations about his social network: 

  • Catalano/Vidro’s “normative” social pattern only occurs on Days 1, 2, 4, 5 and 7.
  • Nobody from the “normative inner circle” (i.e., 1, 2, 3, 5 and 200), communicates with anyone else in the normative circle after Day 7.  Could it be that Ferdinando sent them to the United States at this point?
  • 200 communicates with both 97 and 137 on Day 9, and just 97 on Day 10.
  • 200 is involved in an “inner-circle” conversation on every day (except Day 8).

 

Due to the above simple visualization of this data, and what we know about the situation on Isla del Sueńo, we were able to make these observations.  We also played with several other variants of the graph, including the “directedness” of the graph.  While we chose an undirected graph for all of the results shown above (because we considered a conversation between two people to be a two-way communication), we also looked at a directed version of the graph, where the edge between two vertices was directed going from the person who called to the person who was being called.  When we did that, we noticed that 97 and 137 are never called by 1, 2, 3 and 5 – and they only call 5 and 200.

 

The advantage of this approach in terms of this contest is that we are able to show how a simpler visualization technique, combined with graph mining tools, can help an analyst focus their search for relevant information in what can be a fairly complex network of communications.  Traditional data mining approaches involve the probabilities and distributions of data values, while a graph-based approach such as this can discover differences in data when structure and relationships are represented as nodes and links.

 

While the cell-phone calls were made over a ten-day period, as shown above we chose to represent each day as an individual sub-graph.  In the future, we will investigate the representation of dynamic graphs which would then allow us to indicate a change in the structure over time.